In the previous lab, you explored the automotive price dataset to understand the relationships for a regression problem. In this lab you will explore the German bank credit dataset to understand the relationships for a classification problem. The difference being, that in classification problems the label is a categorical variable.
In other labs you will use what you learn through visualization to create a solution that predicts the customers with bad credit. For now, the focus of this lab is on visually exploring the data to determine which features may be useful in predicting customer's bad credit.
Visualization for classification problems shares much in common with visualization for regression problems. Colinear features should be identified so they can be eliminated or otherwise dealt with. However, for classification problems you are looking for features that help separate the label categories. Separation is achieved when there are distinctive feature values for each label category. Good separation results in low classification error rate.
As a first step you must load the dataset.
Execute the code in the cell below to load the packages required for the rest of this notebook.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import numpy.random as nr
import math
%matplotlib inline
The code in the cell below loads the dataset and assigns human-readable names to the columns. The shape and head of the data frame are then printed. Execute this code:
credit = pd.read_csv('German_Credit.csv', header=None)
credit.columns = ['customer_id',
'checking_account_status', 'loan_duration_mo', 'credit_history',
'purpose', 'loan_amount', 'savings_account_balance',
'time_employed_yrs', 'payment_pcnt_income','gender_status',
'other_signators', 'time_in_residence', 'property', 'age_yrs',
'other_credit_outstanding', 'home_ownership', 'number_loans',
'job_category', 'dependents', 'telephone', 'foreign_worker',
'bad_credit']
print(credit.shape)
credit.head()
There are 1011(2) rows and 22 columns in the dataset. The first column is customer_id, which is an identifier. We will drop this since this is not a feature.
credit.drop(['customer_id'], axis=1, inplace=True)
print(credit.shape)
credit.head()
Now, there are 21 columns left. Of the 21 columns, there are 20 features plus a label column. These features represent information a bank might have on its customers. There are both numeric and categorical features. However, the categorical features are coded in a way that makes them hard to understand. Further, the label is coded as $\{ 1,2 \}$ which is a bit awkward.
The code in the cell below using a list of dictionaries to recode the categorical features with human-readable text. The final dictionary in the list recodes good and bad credit as a binary variable, $\{ 0,1 \}$. The for
loop iterates over the columns and maps codes to human-readable category names. Having human readable coding of data greatly improves peoples' ability to understand the relationships in the data.
Execute this code and examine the result:
code_list = [['checking_account_status',
{'A11' : '< 0 DM',
'A12' : '0 - 200 DM',
'A13' : '> 200 DM or salary assignment',
'A14' : 'none'}],
['credit_history',
{'A30' : 'no credit - paid',
'A31' : 'all loans at bank paid',
'A32' : 'current loans paid',
'A33' : 'past payment delays',
'A34' : 'critical account - other non-bank loans'}],
['purpose',
{'A40' : 'car (new)',
'A41' : 'car (used)',
'A42' : 'furniture/equipment',
'A43' : 'radio/television',
'A44' : 'domestic appliances',
'A45' : 'repairs',
'A46' : 'education',
'A47' : 'vacation',
'A48' : 'retraining',
'A49' : 'business',
'A410' : 'other' }],
['savings_account_balance',
{'A61' : '< 100 DM',
'A62' : '100 - 500 DM',
'A63' : '500 - 1000 DM',
'A64' : '>= 1000 DM',
'A65' : 'unknown/none' }],
['time_employed_yrs',
{'A71' : 'unemployed',
'A72' : '< 1 year',
'A73' : '1 - 4 years',
'A74' : '4 - 7 years',
'A75' : '>= 7 years'}],
['gender_status',
{'A91' : 'male-divorced/separated',
'A92' : 'female-divorced/separated/married',
'A93' : 'male-single',
'A94' : 'male-married/widowed',
'A95' : 'female-single'}],
['other_signators',
{'A101' : 'none',
'A102' : 'co-applicant',
'A103' : 'guarantor'}],
['property',
{'A121' : 'real estate',
'A122' : 'building society savings/life insurance',
'A123' : 'car or other',
'A124' : 'unknown-none' }],
['other_credit_outstanding',
{'A141' : 'bank',
'A142' : 'stores',
'A143' : 'none'}],
['home_ownership',
{'A151' : 'rent',
'A152' : 'own',
'A153' : 'for free'}],
['job_category',
{'A171' : 'unemployed-unskilled-non-resident',
'A172' : 'unskilled-resident',
'A173' : 'skilled',
'A174' : 'highly skilled'}],
['telephone',
{'A191' : 'none',
'A192' : 'yes'}],
['foreign_worker',
{'A201' : 'yes',
'A202' : 'no'}],
['bad_credit',
{2 : 1,
1 : 0}]]
for col_dic in code_list:
col = col_dic[0]
dic = col_dic[1]
credit[col] = [dic[x] for x in credit[col]]
credit.head()
fw = {'A201' : 'yes', 'A202' : 'no'}
# å—å…¸
data = 'A203'
#print(fw.get(data))
#print(fw[data])
The categorical features now have meaningful coding. Additionally, the label is now coded as a binary variable.
In this case, the label has significant class imbalance. Class imbalance means that there are unequal numbers of cases for the categories of the label. Class imbalance can seriously bias the training of classifier algorithms. It many cases, the imbalance leads to a higher error rate for the minority class. Most real-world classification problems have class imbalance, sometimes severe class imbalance, so it is important to test for this before training any model.
Fortunately, it is easy to test for class imbalance using a frequency table. Execute the code in the cell below to display a frequency table of the classes:
credit_counts = credit['bad_credit'].value_counts()
print(credit_counts)
Notice that only 30% of the cases have bad credit. This is not surprising, since a bank would typically retain customers with good credit. While this is not a cases of severe imbalance, it is enough to bias the training of any model.
As stated previously, the primary goal of visualization for classification problems is to understand which features are useful for class separation. In this section, you will start by visualizing the separation quality of numeric features.
Execute the code, examine the results, and answer Question 1 on the course page.
def plot_box(credit, cols, col_x = 'bad_credit'):
for col in cols:
sns.set_style("whitegrid")
sns.boxplot(col_x, col, data=credit)
plt.xlabel(col_x) # Set text for the x axis
plt.ylabel(col)# Set text for y axis
plt.show()
num_cols = ['loan_duration_mo', 'loan_amount', 'payment_pcnt_income',
'age_yrs', 'number_loans', 'dependents']
plot_box(credit, num_cols)
How can you interpret these results? Box plots are useful, since by their very construction you are forced to focus on the overlap (or not) of the quartiles of the distribution. In this case, the question is there sufficient differences in the quartiles for the feature to be useful in separation the label classes? The following cases are displayed in the above plots:
As an alternative to box plots, you can use violin plots to examine the separation of label cases by numeric features. Execute the code in the cell below and examine the results:
def plot_violin(credit, cols, col_x = 'bad_credit'):
for col in cols:
sns.set_style("whitegrid")
sns.violinplot(col_x, col, data=credit)
plt.xlabel(col_x) # Set text for the x axis
plt.ylabel(col)# Set text for y axis
plt.show()
plot_violin(credit, num_cols)
The interpretation of these plots is largely the same as the box plots. However, there is one detail worth noting. The differences between loan_duration_mo and loan_amount for good and bad credit customers are only for the more extreme values. It may be that these features are less useful and the box plot indicates.
Now you will turn to the problem of visualizing the ability of categorical features to separate classes of the label. Ideally, a categorical feature will have very different counts of the categories for each of the label values. A good way to visualize these relationships is with bar plots.
The code in the cell below creates side by side plots of the categorical variables for each of the labels categories.
Execute this code, examine the results, and answer Question 2 on the course page.
import numpy as np
cat_cols = ['checking_account_status', 'credit_history', 'purpose', 'savings_account_balance',
'time_employed_yrs', 'gender_status', 'other_signators', 'property',
'other_credit_outstanding', 'home_ownership', 'job_category', 'telephone',
'foreign_worker']
credit['dummy'] = np.ones(shape = credit.shape[0])
for col in cat_cols:
print(col)
counts = credit[['dummy', 'bad_credit', col]].groupby(['bad_credit', col], as_index = False).count()
temp = counts[counts['bad_credit'] == 0][[col, 'dummy']]
_ = plt.figure(figsize = (10,4))
plt.subplot(1, 2, 1)
temp = counts[counts['bad_credit'] == 0][[col, 'dummy']]
plt.bar(temp[col], temp.dummy)
plt.xticks(rotation=90)
plt.title('Counts for ' + col + '\n Bad credit')
plt.ylabel('count')
plt.subplot(1, 2, 2)
temp = counts[counts['bad_credit'] == 1][[col, 'dummy']]
plt.bar(temp[col], temp.dummy)
plt.xticks(rotation=90)
plt.title('Counts for ' + col + '\n Good credit')
plt.ylabel('count')
plt.show()
There is a lot of information in these plots. The key to interpreting these plots is comparing the proportion of the categories for each of the label values. If these proportions are distinctly different for each label category, the feature is likely to be useful in separating the label.
There are several cases evident in these plots:
Notice that only a few of these categorical features will be useful in separating the cases.
In this lab you have performed exploration and visualization to understand the relationships in a classification dataset. Specifically: